Automatic Detect ion of Text Genre
ثبت نشده
چکیده
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties. 1 I n t r o d u c t i o n Computational linguists have been concerned for the most part with two aspects of texts: their structure and their content. That is. we consider texts on the one hand as formal objects, and on the other as symbols with semantic or referential values. In this paper we want to consider texts from the point of view of genre: that is. according to the various functional roles they play. Genre is necessarily a heterogeneous classificatory principle, which is based among other things on the way a text was created, the way it is distributed, the register of language it uses, and the kind of audience it is addressed to. For all its complexity, this attribute can be extremely important for many of the core problems that computational linguists are concerned with. Parsing accuracy could be increased by taking genre into account (for example, certain object-less constructions occur only in recipes in English). Similarly for POS-tagging (the frequency of uses of trend as a verb in the Journal of Commerce is 35 times higher than in Sociological Abstracts). In word-sense disambiguation, many senses are largely restricted to texts of a particular style, such as colloquial or formal (for example the word pretty is far more likely to have the meaning "rather" in informal genres than in formal ones). In information retrieval, genre classification could enable users to sort search results according to their immediate interests. People who go into a bookstore or library are not usually looking simply for information about a particular topic, but rather have requirements of genre as well: they are looking for scholarly articles about hypnotism, novels about the French Revolution, editorials about the supercollider, and so forth. If genre classification is so useful, why hasn' t it figured much in computational linguistics before now? One important reason is that, up to now, the digitized corpora and collections which are the subject of much CL research have been for the most part generically homogeneous (i.e., collections of scientific abstracts or newspaper articles, encyclopedias, and so on), so that the problem of genre identification could be set aside. To a large extent, the problems of genre classification don't become salient until we are confronted with large and heterogeneous search domains like the World-Wide Web. Another reason for the neglect of genre, though, is that it can be a difficult notion to get a conceptual handle on. particularly in contrast with properties of structure or topicality, which for all their complications involve well-explored territory. In order to do systematic work on automatic genre classification. by contrast, we require the answers to some basic theoretical and methodological questions. Is genre a single property or attribute that can be neatly laid out in some hierarchical structure? Or are we really talking about a muhidimensional space of properties that have little more in common than that they are more or less orthogonal to topicality? And once we have the theoretical prerequisites in place, we have to ask whether genre can be reliably identified by means of computationally tractable cues. In a broad sense, the word "genre" is merely a literary substitute for "'kind of text," and discussions of literary classification stretch back to Aris-
منابع مشابه
Automatic Metrics for Genre-specific Text Quality
To date, researchers have proposed different ways to compute the readability and coherence of a text using a variety of lexical, syntax, entity and discourse properties. But these metrics have not been defined with special relevance to any particular genre but rather proposed as general indicators of writing quality. In this thesis, we propose and evaluate novel text quality metrics that utiliz...
متن کاملA survey on Automatic Text Summarization
Text summarization endeavors to produce a summary version of a text, while maintaining the original ideas. The textual content on the web, in particular, is growing at an exponential rate. The ability to decipher through such massive amount of data, in order to extract the useful information, is a major undertaking and requires an automatic mechanism to aid with the extant repository of informa...
متن کاملAutomatic Genre Identification: Towards a Flexible Classification Scheme
This paper presents an automatic genre classification model that implements a flexible classification scheme, i.e. a scheme capable of performing zero-, oneor multi-genre assignment. I suggest that this scheme is more appropriate for genres on the web, because many web pages have often more than one genre or none at all. The model that I propose relies on the distinction between the concepts of...
متن کاملAutomatic Detection of Text Genre
As the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. We propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection b...
متن کاملDifferent Flavors of GUM: Evaluating Genre and Sentence Type Effects on Multilayer Corpus Annotation Quality
Genre and domain are well known covariates of both manual and automatic annotation quality. Comparatively less is known about the effect of sentence types, such as imperatives, questions or fragments, and how they interact with text type effects. Using mixed effects models, we evaluate the relative influence of genre and sentence types on automatic and manual annotation quality for three relate...
متن کامل